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Abstract 

Designing concurrent data structures should follow some basic rules. By separating the algorithms 
into two phases, we present guidelines for scalable data structures, with a analysis model based on the 
Amadal’s law. To the best of our knowledge, we are the first to formalize a practical model for measuring 
concurrent structures’ speedup. We also build some edge-cutting BSTs following our principles, testing 
them under different workloads. The result provides compelling evidence to back the our guidelines, and 
shows that our theory is useful for reasoning the varied speedup. 


1 Introduction 

As multi-core chips are widely used in commodity devices, designing concurrent structures has become a hot 
topic. These years, researchers focus on designing concurrent BSTs. Normally, sequential BSTs should be 
entirely locked when accessed by multi-threads. Concurrent BSTs leverage the property that modifications 
naturally happen in disparate places, therefore using finer-grained locks or flags could boost the parallelism. 
To improve the performance of concurrent BSTs, there are several aspects of optimization. From the perspec¬ 
tive of hardware interface, we could apply varied atomic operations, like compare-and-swap[li|, fetch-and- 
addQ. By the help of underlying system support, we can devise RCU[§] and STmJEJ- From the perspective 
of structures, external trees and internal trees are both available. To achieve greater disjoint-parallelism(5], 
the locks which previous on the nodes could be moved to the edges. 

ASCYLIB@ is a concurrent structure library, including a bunch of different structures such as linked-list, 
hash-table, and BSTs. The core of ASCY is that concurrent structures should resemble their sequential 
counterparts. The author addresses that structures follow ASCY-compliant pattern use less power consump¬ 
tion, and achieve portable scalability that scale well under different workloads and platforms. 

In this paper, we adopt the similar idea as ASCY to implement different concurrent BSTs. We compare their 
performance under various workloads and platforms, and propose our own principles of designing concurrent 
structures. Based on the Amdahl’s law[6[, we present the first model to analyze speedup for concurrent 
structures. 


2 Preliminary 

We view the BST as a dictionary to retrieve unique key-value pairs. There are three kinds of operations 
with the dictionary, we define them as follow: 

• Search(fcey). It calls find operation to reach the corresponding leaf node, and returns true if the key 
matches, or it returns false. 

• Insert(fcey). It begins to reach the candidate leaf node, if there’s already such a key, it returns false. 
Otherwise it adds the key into the dictionary. 

• Delete(fcey). It begins to reach the candidate leaf node, if there’s no such a key, it returns false. 
Otherwise it removes the key from the dictionary. 
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To formalize the operations from the perspective of thread interactions, we propose a interface of our design in 
figure 1. Each operation will take “snapshot” of the tree by the find routine, and adopt different consistency 
controller to manage contentions and start retry. The similar idea is used by [Hi] , to facilitate the performance 
under a little modification, the author use simple locks with some checks to develop a concurrent skiplist. 

There are generally two types of structures in BSTs, one is the internal tree, the same as sequential BSTs; 

Search(key) 


Take Snapshots Consistency Control 
Update(key, value) 

Figure 1: Operation Interface 

the other is the external tree, using more space but reduces contention to the leaf nodes. Figure 2 shows the 



Figure 2: Two types of BSTs 

different structures, where the external tree stores keys only in leaves, but the internal tree stores keys in 
both. The internal tree will be scaled up as the key buckets growing, whereas external tree renders better 
performance when the key range is small [ill]. We use the external tree to clearly our idea, and we believe 
that the same technique could be applying to the internal tree with little adapt. Our FEM-BST, using flag 
and mark indicators, is also very friendly to be adjusted into a lock-free version. 

To avoid some special situations, we confine the key range in (— 00 , 00 ) as describe in [3j. Figure 3 shows the 
initial structure, it is guaranteed that the initial three nodes will never be removed. As illustrate in table 



Figure 3: Initial structure of our BST 

1, we implement 5 kinds of BSTs using different locks, all of which are CAS-based locks. Furthermore, we 
implement different consistency controllers to check whether the state is consistent during the modification. 
The ticket lock uses two numbers, ticket and version. The version is to record the current version of the 
node, the ticket is used for lock. If the current ticket is not equal to the version , it indicates that the node 
is locked. The flag-marked lock simply use two boolean field. The flag field is to indicate whether the node 
is owned by a thread or not, the marked field is to denote whether the node is under the delete operation. 
In this paper, we introduce the algorithm of FEM-BST. It has locks on nodes, but the performance is no 
different as the formal edge-based locks. We design fine checking mechanism to ensure the correctness and 
improve parallelism. 
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Name 

Lock 

BST 

none 

SYN-BST 

synchronized 

FN-BST 

flag-based lock on node 

FE-BST 

flag-based lock on edge 

FEM-BST 

flag-mark-based lock on edge 

TN-BST 

ticket lock on node 


Table 1: The variant BSTs 


3 Algorithm 

3.1 Search 

We use ppred to denote the grandparent node, pred to denote the parent node, and curr for the current 
node. Moreover, pright and right represent the directions. As shown in the algorithm FIND , the operation 
goes from the root node to the corresponding leaf node, and returns a snapshot which includes 5 elements: 
{ppred, pright, pred, right, curr} in figure 4. The search algorithm is based on the optimistic strategy, which 


Algorithm 1 Find 

l: curr <— root 
2: ppred, pred null 
3: while curr ^ leaf do 
4: taking snapshot 

5: if curr.key < key then 

6: curr curr.left 

7: else 

8: curr curr.right 

9: end if 

10: end while 

ll: Return snapshot 


finds the node was in the tree on the search path, other than the node is currently in the tree. In fact, we 
can hardly implement such an algorithm that returns the result at the exact time-stamp. 



Figure 4: Take snapshot from top to down 


3.2 Insert 

To begin with, the insert operation gets the snapshot in line 2. It compares whether the key is equal to 
the curr node’s. If there’s such a key in the tree, it returns false. Otherwise it tries to lock the node, and 
handles some inconsistency situations. Figure 5 shows two inconsistency situations, one is that the pred 
node is marked{line 9), which indicates there’s another operation deleting the pred node; the other is when 


3 









the pred node is not linked to the curr nod e(line 13). Either of the situation indicates the operation has to 
retry to find the new corresponding node. There’s no need to retry if the parent node is only locked, since 
it indicates that there’s another node inserted in the other side while does not affect the current operation. 
The insert operation is guaranteed to be succeed when it locks the current node. Finally it constructs the 
node, and releases the lock. 


Algorithm 2 Insert 

1: while TRUE do 

2: {curr, pred, right} find(key) 

3: if curr.key == key then 

4: Return FALSE 

5: end if 

6: if \curr.tryLock}) then 

7: Continue 

8: end if 

9: if pred.marked then > parent node already deleted 

10: curr.release}) 

11: Continue 

12: end if 

13: if right AND pred.right ^ curr OR \right AND pred.left ^ curr then 

14: curr.release () 

15: Continue 

16: end if 

17: Construct newParent and insertNode 

18: if right then > according to right flag 

19: pred.right •<— newParent 

20 : else 

21: pred.left newParent 

22: end if 

23: curr.release}) 

24: Return TRUE 

25: end while 


3.3 Delete 

Like the insert operation, the delete operation starts by getting the snapshot in Algorithm Delete line 2. 
However, it has to get an extra ppred node and pRight indicator, as it should move the grandparent’s link 
to the sibling of the removed node. It first compares the key to the current node, if it is not equal, it returns 
false{line 4). Otherwise it first tries to lock the pred nod e(line 7). It then checks the ppred node’s state(Zme 
11), ensuring the grandparent is neither marked or linked to another node. If it successes, it then begins 
to lock the current node(Zmel6). Both the above locks has to be released once it detects inconsistency. 
After locking the parent node and the current node, it has to wait for the operation upon siblingnode to be 
finislied(7me [34 — 48]). The delete operation will successfully remove the node from the tree when it gets 
the correct sibling. 

Figure 6 shows the two situations the algorithm Delete has to retry. Both happen on the grandparent 
node. For the situation b, we have to first check whether the node is released or not. Because once the 
sibling node is released, it could not be locked again when the parent node is locked. Therefore we do not 
have to retry from the root node. 
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• Mark 

(b) 


Figure 5: Two wrong situations of insert operation 



Figure 6: Two wrong situations of delete operation 


5 













Algorithm 3 Delete 


1 

while TRUE do 


2 

{curr, pred, ppred, right, pright} •<— find(key) 


3 

if curr.key ^ key then 


4 

Return FALSE 


5 

end if 


6 

Construct new node 


7 

if pred.marked OR Ipred.tryLockQ then 

> check parent node 

8 

Continue 


9 

else 


10 

pred.marked = TRUE 


11 

if ppred.marked OR pRight AND ppred.right <f- pred OR \pRight AND ppred.left ^ pred then 
> check grandparent node 

12 

pred.marked <— FALSE 


13 

pred.relasef) 


14 

Continue 


15 

end if 


16 

if \curr.tryLock() then 

> check curr node 

17 

pred.marked FALSE 


18 

pred.releasei) 


19 

Continue 


20 

else 


21 

curr.marked •<— TRUE 


22 

if right AND pred.right ^ curr OR !right AND pred.left. ^ 

curr then 

23 

curr.marked false 


24 

curr.releasel) 


25 

pred.marked •<— false 


26 

pred.releasei) 


27 

Continue 


28 

end if 


29 

if right then 

> get sibling node 

30 

node pred.left 


31 

else 


32 

node <r- pred.right 


33 

end if 


34 

while TRUE do 


35 

if right then 


36 

if node.lock OR pred.left node then 

> The order cannot be changed 

37 

node •<— pred.left 


38 

Continue 


39 

end if 


40 

Break 


41 

else 


42 

if node.lock OR pred.left ^ node then 

> The order cannot be changed 

43 

node <r- pred.left 


44 

Continue 


45 

end if 


46 

Break 


47 

end if 


48 

end while 


49 

end if 


50 

end if 
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51: if pRight then 

52: ppred.right <— node 

53: else 

54: ppred.left ■*— node 

55: end if 

56: Return TRUE 

57: end while 


4 Correctness 

We first prove that our FEM-BST maintains the property during executions, then prove it is deadlock free, 

and point out the linearization points. The proof structure is similar as 0- 

4.1 maintain structure 

We prove that the following invariants hold during the modification: 

1. The root node is never removed. 

2. The key field never changes. 

3. The left child’s key is always less than the parent node, the right child’s key is always greater or equal 
to the parent node. 

4. Once a parent node is locked and checked, the insert operation must succeed. 

5. Once the parent node and current node are locked and checked, the delete operation must succeed. 

Proof: 

1. The available key range is in (— 00 , 00 ), therefore the initial three nodes will never be retrieved by 
modifications. 

2. An insert operation is finished by constructing two new nodes, linked with the existing node; A delete 
operation is finished by moving the grandparent pointer to the sibling of removed node. Hence the key 
field never changes. 

3. Any modification upon the BST take a correct snapshot at a specific time-stamp, hence the curr node 
must be a child of the pred node. For a insert operation, the newly construct node follows the right 
direction; for a delete operation, the grandparent’s pointer pointed to the existed child in the tree. 

4. The insert operation first tries to lock the curr node. After locking, no other operation could take 
upon the curr node. It then checks whether the pred node is marked. If it is marked, the curr lock is 
released, the operation retries to find a new corresponding node. Otherwise the insert operation will 
succeed. 

5. The delete operation tries to lock the node in the following order: pred— > curr. locks on the pred 
node and curr node ensure that other thread could see the marked status, thereby do not affect the 
current deletion. The sibling node will finally be in a clean state, since there’s finite set of modify 
operations. 

4.2 liveness 

We prove our FEM-BST is deadlock-free. 

An important observation is that insert and delete operations locks the node from top to down order. The 

insert operation only locks the curr node. The delete operation first locks the pred node, and release it once 

it detects the child is locked. Therefore once contention is detected, the operations will be rolled back. 

Another contention happens in the delete operation is when it tries to lock the sibling node. An important 
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observation is that insert and delete operations locks the node from top to down order. Because there’s finite 
insert and delete operation, and any operation locks the node from top to down order, the pred and cur 
will not be violated. Finally the delete operatoin could get the sibling in isolation. Thus the FEM-BST is 
deadlock-free. 


4.3 linearization point 

• Insert. The linearization point of a successful insert operation is in Algorithm 2 line 6; the linearization 
point of a failure insertion operation is in Algorithm 2 line 4. 

• Delete. The linearization point of a successful delete operation is in Algorithm 3 line 6 and line 10; 
the linearization point of a failure delete operation is in Algorithm 3 line 4. 


5 Performance 

All of our BSTs in table 1 are implemented by Java, JDK 1.7. We set up the test by randomly inserting 
buc £ et elements into the tree, and then running cases threads for 5s to ensure that elements are inserted and 
deleted multiple times. A similar idea is used in ,l|. Experiments are performed on the platform of two Intel 
E5-2680 processors, 32 hardware threads with hyper-thread supported. The system is Red Hat Enterprise 
Linux Server release 6.3. 

We compared different BSTs with respect to throughput, which is defined as the total number of operations 
completed per second. The number of threads was set from 1 to 32 and the bucket size is 10000 and 100000. 
We use two workloads: low-contention: 9% insert, 1% delete, 90%search, and mid-contention: 20% insert, 
10% delete, 70% search. Figure 7 shows the result under different modification distribution. The result 
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Figure 7: Comparison of throughput of different BSTs under varied workload 

shows that our BSTs render similar performance as the unsynclironized-BST, which is used to stand for a 
possible upper-bound. The FN-BST is the least scalable one, since it need to lock more nodes than edge 
based BSTs. It is also worse than the TN-BST, which has to get little snapshot during searching. Our 
FEM-BST has the best performance, since it handles every possible inconsistency. We conclude our theory 
in the following part. 









6 Principles 

We propose some concurrent data structure design principles according to our experiment and the interface 
in figure 1: 

1. To achieve correctness, we must ensure either the snapshot or consistency controller exists. 

2. For the general lock-based algorithms, the more the amount of snapshot, the less complexity of con¬ 
sistency controller. 

3. For the lock-free algorithms, we might obtain both complex snapshot and consistency controller. 

4. Snapshot and consistency work together to affect single-thread performance. The higher the single¬ 
thread performance, the lower the progress conditions (parallelism). 

Table 2 lists out our analyze of common concurrent techniques. Sycrhonized is for a coarse-grained version 
of sequential data structure, which handles the contention by locking the whole object. STM utilizes 
parallelism by optimistic control strategy, and it should obtain a large amount of snapshot. TicketLock is 
mentioned above, which implements the lock by version numbers. Fair Lock represents fine-grained locks 
use a queued structure. NonFairLock is implemented by flags. Lockfree algorithms usually need very 
detailed design with thread interactions. Wait free algorithms are more strict than Lockfree in progress 
condition, the only known waitfree structures are queue [8j and linked-listflij]. 


Techniques 

Parallelism 

Snapshot 

Consistency Controller 

Sychronized 

very low 

very low 

high 

STM 

low 

medium 

medium 

TicketLock 

medium 

high 

low 

FairLock 

low 

low 

high(AQS framework) 

NonFairLock 

medium 

low 

high 

Lockfree 

medium 

medium 

high 

Waitfree 

high 

medium 

very high 


Table 2: Concurrent Techniques Comparison 


7 Model 


To the best of our knowledge, there’s no any practical model fit for measuring the speedup of concurrent data 
structures. Here we present an analysis model to transform the initial amadal law for concurrent structures. 

speeinp = (i-ri+p/p 

The above equation is the most common known form of Amadal’s law, where p is the parallel ratio of a 
program. We assume p = 1 in concurrent structures, thereby the traditional model needs to be modified. 
We start from comparing the workload of sequential part(w s ) and parallel part(u;p) of sequential structure 
to the concurrent structure where the parallel workload(w p ) is different. 


speedup 


w s + w P 

w s + Wp/P 

_ Wp _ 

1 Fp {wsnapshot , UJcontrol ) / P(uisnapshot , UJcontrol ) 


In the original equation, P is defined as the number of processors, however in the concurrent structure, it is 
nearly impossible that all of the threads are taking into effect. Therefore we define P(w snapshot ,w control) as 
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Figure 8: a(t) = " sn 2 £?h°± 1 (_ 4 + 1), 1 < /z(hardness of the structures) 


a function which represents real parallelism. Furthermore, since new operations, snapshot and consistency 
control, are involved in the parallel version, we define W p (w sn apshot, w C ontroi) as the new parallel workload. 

Wp (w snapshot , W control ) — HJ p T HJ snapshot T HJ control 
P {,HJ snapshot •, HI control) = P * (1 c)*0 

0 < c < 1 


Where c stands for the contention rate, a is the rate of taking effects on linearization points, /? is the rate 
of recording valid linearization points. Therefore c is a experiment related variable, a and j3 are algorithm 
related variables. 


speedup = 


1 + 


P * (1 — c) * a 

w snapshot j W cori trol 


1 


HI snapshot 


< 8 <1 


Q < a < HJ snapshot * ft ^ ^ 
HJ control 


The a factor is associated with the hardness of the sequential structure, where we define the hardness 
is proportional to the amount of adjust of the sequential part. Hence, the greater the a element has to 
communicate with others, the harder the structure, which means it needs more time to raise a. Figure 8 
demonstrates our measure of a factor. We have to pay much more amount of effort into “hard” structures. 
For instance, for the heap, we have to lock the whole path from root to leaf during modifications. Hence, 
to relax such a adjustment is difficult. However queue only need to modify the tail and head, therefore is 
easier to raise parallelism. 


8 Conclusion 

We present a pattern of design concurrent data structures with a model to formalize the speedup measure. 
We also provide compelling evidence by measuring different kinds of BSTs under various workloads. An 
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immediate discussion in the future would be implementing other structures such as skip-lists and heaps to 
illustrate our model. Another topic is to refine our model to measure the speedup accurately, and develop a 
software for practice. 
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